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Abstract 

^  Given  a  prescribed  order  b  which  to  btroduce  reroes,  and  constrabts  on  the  architecture 
it  is  shown  how  to  develop  a  parallel  QR  factorisation  based  on  fast  Givens’  rotations  for  a  rect¬ 
angular  array  of  processors,  suitable  to  VLSI  implementation.  Unlike  designs  based  on  standard 
Givens’  transformations,  the  present  one  requires  no  square  root  computations.  Assuming  each 


processor  performs  the  elementary  operations  (+,«,/),  less  than 


roeessors  can  achieve  the 


decomposition  of  a  ^-banded,  order  n  matrix  b  time  O(n). 

Application  is  made  to  a  variant  of  Bareiss’  G- Algorithm  for  the  solution  of  weighted  multiple 
linear  least  squares  problems.  Given  k  different  right  hand  side  vectors,  +  kw)  processors 
compute  the  factorisation  b  0(n  +  k)  steps. 
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1.  Introduction 

Several  recent  papers,  surveyed  in  (H183],  demonstrate  that  a  QR  decomposition  based  on 
Givens’  rotations  lends  itself  well  to  parallel  implementation  on  a  rectangular  grid  of  processors 
in  silicon.  In  [H183]  it  is  shown  how  less  than  0(vs)  processors  accomplish  the  decomposition  of 
an  order  n  matrix  with  bandwidth  w  in  time  en,  e  a  small  constant.  However,  all  extant  designs 
make  use  of  standard  Givens’  rotations.  Their  drawback  is  the  need  to  compute  a  square  root  and 
multiplications  for  each  element  to  be  removed.  Multiplication  and  even  more  so  square  root 
still  belong  to  the  most  expensive  operations,  in  terms  of  both  chip  area  and  time;  their  avoidance 
b  desirable. 

Fast  Givens’  rotations  are  the  remedy.  Rather  than  to  the  original  matrix  .A,  they  are  applied 
to  a  factored  form  A  =  DA,  D  being  a  nonsingular  diagonal  matrix.  The  square  root  is  obviated 
and  the  number  of  multiplications  reduced  by  fifty  percent. 

Fast  Givens’  rotations  (their  properties  are  summarised  in  (Ham74)),  as  they  occur  in  the 
context  of  QR  decomposition,  are  used  for 

•  updating  of  the  solution  of  linear  systems  in  unconstrained  optimisation  of  homogeneous  func¬ 
tions  (KK78], 

•  solution  of  linear  systems  in  the  revised  simplex  method  for  linear  programming  [HW79], 

•  solution  to  linear  least  squares  problems  lGen73], 

•  similarity  transformations  in  the  Jacobi  method,  reduction  to  Hessenberg  form  and  the  QR 
method  for  eigenvalue  computations  ]Rat82]. 

The  processor  grid  b  easily  extended  to  handle  solutions  of  linear  least  squares  problems  as 
in  [Gen73]  with  multiple  right  hand  sides.  An  interesting  extension  is  possible,  though.  Slightly 
modifying  the  processors  to  compute  Bareiss’  G-Transformations  of  order  2,  one  obtains  a  processor 
grid  for  the  solution  of  weighted  least  squares  problems. 

As  for  the  organisation  of  thb  paper,  a  brief  description  of  standard  and  fast  Givens’  rotations 
b  followed  by  some  comments  on  the  restrictions  placed  on  parallel  architectures  by  technology. 
Thereafter  the  QR  algorithm  is  presented,  interconnections  and  data  flow  of  the  processor  array 
are  determined  and  finally  applied  to  the  solution  of  weighted  linear  least  squares  problems.  The 
last  section  remarks  on  how  to  extend  the  design  to  lower  triangular  decompositions  and  matrices 
the  bandwidths  of  which  disagree  with  the  array  site. 

Householder’s  notation  will  be  used  throughout  the  paper. 


8.  Standard  and  Fast  Ghent’  Rotations 

The  Givens’  plane  rotation  is  a  computationally  stable  device  for  introducing  teros  into  a 
matrix,  and  it  will  be  illustrated  how  it  inserts  a  tero  in  the  (2,1)  entry  a  2  x  n  matrix,  n  >  1. 

The  standard  Givens’  rotation  [Wilk&5],  which  alters  the  matrix  proper,  is  a  2  x  2  transfor¬ 
mation 
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To  do  away  with  square  roots  and  a  major  portion  of  the  multiplications,  fast  Givens'  rotations 
[Gen73,  Ham74]  modify  the  matrix  in  factored  form 
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There  are  several  choices  (Harnm74]  for  <{  and  1'2  which  permit  a'lf  and  a#  to  be  computed  with 
only  two  multiplications;  among  them  two  which  •  when  properly  combined  -  permit  easy  control 
of  element  growth  : 
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S.  Parallel  Architecture 

The  following  paragraphs  will  illustrate  the  development  of  a  fast-Grvens-QR  method  for  a 
parallel  processor  device  to  be  implemented  in  silicon  (VLSI  for  instance).  Constraints  imposed  by 
technology  and  fabrication  demand  regular  (and  if  possible  planar)  processor  interconnections  as 
well  as  processor  communication  on  a  nearest  neighbour  bash.  The  obvious  choice  of  structure  b  a 
rectangular  array  of  synchronously  operating  processors.  Furthermore,  to  keep  the  chip  area  min* 
imal,  processors  should  perform  no  other  than  the  elementary  operations,  addition,  multiplication 
and  divsion. 

The  constraint  of  local  data  exchange  suggests  a  combination  of  pipelining  and  multiprocessing 
to  achieve  good  processor  utilisation  and  speed  up  in  computation  time.  Pipelining  b  efficient  for 
long  matrices,  which  often  possess  a  narrow  dense  band.  The  hardware  should  reflect  the  features 
of  the  matrix  :  a  processor  count  proportional  to  the  bandwidth,  not  the  order  of  the  matrix. 
Consequently,  I/O  occurs  by  codiagonals  rather  than  rows  or  columns. 

A  QR  method,  chosen  with  regard  to  the  preeeeding  thoughts,  b  presented  next;  it  will  be 
followed  by  a  derivation  of  data  flow  and  dbtribution  of  operations  on  the  processor  grid. 

t  Paralls]  QR  Decomposition 

The  QR  decomposition  of  a  matrix  A  determines  a  factorisation 

A  -  QR, 

into  an  upper  triangular  matrix  R  and  an  orthogonal  matrix  Q,  the  product  of  Givens  rotations. 
For  fast  Givens’  transformations  in  particular,  thb  takes  the  form 

DA  «  QD'R, 

given  that  A  ■  DA,  again  R  b  upper  triangular,  Q  orthogonal  and  D,  Df  are  nonsingular  diagonal 
matrices. 
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Matrices  to  be  considered  are  square,  of  order  n,  with  a  (presumably  narrow  and  dense)  band 
of  width 


vmp  +  q  +  l, 


q  £  0  being  the  number  of  subdiagonab  and  p  >  0  the  number  of  superdiagonals. 


(PI)  The  QR  factorisation  preserves  the  bandwidth  w  of  a  matrix  A  :  elimination  of  <**+<,,• 
causes  Ell-in 

Sameh  and  Kuck  [SK7SJ  proposed  a  simple  elimination  strategy  which,  subject  to  (P2)  below, 
adheres  to  regular  data  flow  and  local  communication  :  banded  matrices  are  reduced  to  triangular 
(banded)  form  through  annihilation  of  elements  by  subdiagonals,  starting  from  without  towards 
the  main  diagonal,  while  proceeding  from  top  to  bottom  within  a  subdiagonal.  Formally,  if  otf+t,! 
is  removed  at  time  t  *  1  then  ot_*+t-,,-  b  removed  at  t  «  Jb  + 1,  0  <  t  <  q  -  1,  l£i£n-q  +  i. 
In  the  example  below  for  q  ■  2,p  «  3,n  «  6,  matrix  entry  (A  +  s',  s*)  contains  the  time  of  removal 
of  (subdiagonal)  element  (A  +  i,t). 


xxx* 

2  x  x  s  * 

1  3  x  *  x  x 

2  i  x  x  x 

3  5  x  x 
4  6  * 


(P2)  Elements  are  removed  by  rotations  in  adjacent  planes  :  elimination  of  a tales  place 
by  rotating  planes  A  + »  -  1  and  k  + «'. 

(P3)  Each  raw,  excepting  the  first  and  the  last,  b  modified  by  two  successive  rotations. 

To  keep  indexing  in  the  algorithms  consistent,  it  b  presumed  that  rows  affected  by  less  than  two 
rotations  participate  in  identity  transformations.  Under  the  temporary  assumption  thmt  pivoting 
b  unnecessary,  the  above  strategy  leads  to  the  following  (still  sequential)  algorithm,  implementing 
ease  (a).  For  each  subdiagonal  k,  q  2;  k  J;  1,  the  auxiliary  variable  As  represents  intermediate 
values  of  diagonal  elements  between  two  rotations  (and  so  do  matrix  elements  with  fractional 
superscripts  -  remember  (P3));  in  the  hardware  implementation  it  will  correspond  to  a  register. 
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Hence, 

£,(*-!)  „  (?(*)£>(*)A(*)(  q>  k>l, 

g(*)  *  ...  Q  (*•»), 

where  stands  for  the  ‘proper  embedding’  of  in  the  n  x  n  identity  matrix. 


S.  Processor  Vector 

All  iterations  of  the  innermost  /'loop,  expressed  as  equations  [A  +  •',»  +  j],  could  be  executed 
in  parallel,  if  were  to  be  globally  broadcast  to  at  least  w  processors.  As  this  is  to  be 
shunned,  the  iterations  are  performed  successively  b  different  processors  through  which  p(*>0 
is  pipeUned.  Each  eodiagonal  is  bput  to  a  different  processor,  starting  from  the  left  with  the 
outermost  subdiagonal  entering  processor  1.  Thereby  due  to  (Pi),  a  vector  of  w  processors  suffices 
to  elimbate  the  outermost  subdiagonal  b  2n  steps  (a  step  is  defined  as  the  computation  time  of 
the  slowest  processor). 

Consider  the  example  9*2  during  the  first  pass  through  the  Hoop  (fc  *  q  «  2).  The  compu¬ 
tation  of  D  will  be  disregarded  for  a  moment.  Sbee  at  least  w  processors  are  available  let  processor 
f  compute  equation  (2  +  •’,•'+/].  Thus,  for  processor  1  to  remove  o^J  by  generatbg  pfc*)  at  time 

I  *  A,  it  must  also  contain  a^,  see  Figure  1.  b  the  next  time  step  A  +  1,  />&*)  is  ready  to  be 


S 


input  into  processor  2,  together  with  and  Processor  2  then  computes  equation  [2,2), 
resulting  in  aQ  and  aj'^.  According  to  (P3),  a^’,  has  to  participate  in  a  second  rotation  for 
the  removal  of  o^'j.  Consequently,  o[*2,  available  at  t  =  A  +  2,  has  to  enter  processor  1  together 
with  for  the  determination  of  P^. 

It  b  now  possible  to  construct  the  processor  vector,  depicted  in  Figure  2.  After  having  entered 
a  processor  from  below  to  participate  in  the  first  rotation,  a  matrix  element  enters  the  left  neighbour 
for  the  second  rotation,  and  thereafter  exits  the  top  of  that  processor.  Moreover,  input  to  and 
computation  inside  a  processor  occur  only  every  other  time  step;  hence,  successive  eodiagona] 
elements  are  separated  by  one  time  unit. 

In  detail,  assuming  input  to  the  processor  vector  commences  at  t  *=  1,  then  a  sub-  or  maindi- 
agonal  element  b  input  to  the  vector  through  processor  q—k+l  at  t  *  k+2i—  1;  it  b  output 
from  the  vector  as  o] via  processor  q  -  Jfc  at  t  «*  4  +  2«‘,  0  <  k  <  q.  Similarly,  a  superdiagona] 

element  ajj^-  enters  processor  q-fi+1  atl«i  +  2i-l,  while  leaves  processor  q  -f  k  at 

t  «  k  +  2». 

The  computation  of  D*  remains  to  be  dbcussed.  Since  and  A*  depend  on  the  same  values 
as  PM,  they  can  also  be  determined  in  processor  1,  but  do  not  have  to  be  broadcast.  A;  now 
denotes  a  register  in  processor  1;  at  the  cutset  it  b  initialised  with  (jt  is  not  affected  by  the 
removal  of  subdiagcnal  2).  The  contents  of  A*  are  determined  during  the  computation  of  pM 
and  kept  there  till  needed  for  the  formation  of  D  enters  processor  1  along  with  the  2nd 

subdiagona!  (f^,. .  and  ,  are  input  at  tbe  same  time)  and  leaves  with  the,  new  outermost,  1st 
subdiagonal  (ffi,.  and  are  output  together).  Note,  that  during  the  annihilation  of  subdiagonal 
k,  row*  1 ...  k  -  1  of  AW  and  D remain  unaffected. 

0.  Processor* 

Only  two  different  kinds  of  processors,  sketched  in  Figure  3(a)  and  (b),  will  be  employed  for 
the  triangularisation.  Tbe  leftmost,  first,  processor  in  a  vector  computes  equations  [it  +  *,«],  while 
to  its  right,  processor  /  determines  [i  4-  i,i  +  /}.  When  idle  or  no  input  b  available,  processor  1 
generates  identity  rotations.  The  default  input  for  matrix  elements  b  zero. 

7.  Processor  Amy 

For  the  annihilation  of  q  subdiagonals,  either  a  single  processor  vector  b  used  repeatedly 
(before  the  (q  -  k  +  l)st  pass  register  A,  in  processor  1  b  initialised  with  St,  q  >  k  >  l)  or  else  a 
q  x  w  array,  consbting  of  q  ‘stacked*  processor  vectors,  performs  the  reduction  at  once. 

In  tbe  latter  case,  counting  from  top  to  bottom,  vector  k  is  responsible  for  eliminating  subdi¬ 
agonal  k.  A  and  D  are  input  to  the  bottom  of  the  array,  while  R  and  Df  are  available  at  the  top. 
The  matrix  Q  leaves  the  right  aide  in  factored  form.  Thus,  denoting  the  /fcb  processor  from  the 
left  in  the  ith  vector  by  (k,j),  processor  (t,  I)  computes  equations  [k  +  *,*)  and  processor  (k,j) 
equations  [i + 1,  i+/J.  Figure  4  illustrates  the  data  Row  for  a  2  x  w  array  when  q—  2.  At  tbe  outset 
of  the  computation,  regbter  A*  in  processor  (k,  1)  contains  #*.  The  computation  time  comes  to 
2(n  —  1  +  q). 

S.  Weighted  MnltlpU  Linaar  Least  Square*  Problem* 

With  minor  modifications  the  processor  array  b  adjusted  to  solve  multiple  weighted  linear 
least  squares  problems  for  tn  x  n  matrices  A  with  m  >  n  and  rank(A)  =  n, 

(Ax  -  Si)tW(Ax  -  *,)  «  lD(Ax  -  St)h  m  min,  D*W*,  \<l<h. 
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The  discussion  will  omit  tlx  backsubstitutioo  tod  focus  on  tlx  tritofultr  decomposition. 

The  G  -  Transform  ations  of  Barcbs  for  weighted  lixxtr  letst  squires  problems  [Btr82]  ire,  like 
Householder  Transform  at iocs ,  intended  to  remove  til  elements  in  t  column. 

Let 

B<°»  -  [AM 

and  <J<*>  an  orthogonal  transformation  that  eliminates  the  last  n-k  elements  in  column  k  of  DA, 
then 

.  q(*+Ub<*),  0  <  k  <  n  -  1, 

and 

With  nonsingultr  diagonal  matrices  D^, 

l<k<n-l, 

so  that 

[yl(*+.)i *{*•»)]  „  I){*+1)C(H1)  f  o  <  t  S  »  -  1, 

The  underling  idea  b  to  update  instead  of  £>W  and  thus  obriate  square  roots. 

It  turns  out,  however,  that  matrices  of  order  2  are  equal  to  standard  Givens’  rotations. 
Applying  2x2  matrices  <7^  to  band  matrices  in  the  order  prescribed  by  Sam  eh  and  Kuck  [SK78J, 
one  obtains  a  method  of  the  same  structure  as  before.  The  2x2  transformations  are  determined 
as  follows.  Let 
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where  <J  is  a  standard  Givens’  rotation 
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Note,  that 


*n  ■  !• 

Accordingly,  the  complete  decomposition  algorithm  is  written  sa 


m  W,  •  A; 


for  l  ml. ..It, 

for  kmq...  1,  l 
for  kmq...  if 
for  I  *  1...H  -  k, 

{determine  G  to  remove  itk  element  of  ktk  oakdiogonol } 
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{ update  ith  element  of  (k  —  l)#t  ovkdiagonal) 
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{ update  diagonal  matrix} 
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for  l  m  l...k, 

{apply  G  to  [k  +  i-  l)sl  ond  [k  +  i')tk  element  of  Itk  right  hand  aide  vector ) 
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»)--(&)' 
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Again,  if  the  defining  elements  of  the  transform stion  are  tero  precautions  similar  to  the  pre- 
rious  ones  have  to  be  taken. 

The  interconnections  and  the  data  flow  of  the  processor  array  remain,  only  the  processors 
hare  to  be  slightly  reprogrammed.  In  addition,  k  processor  columns  of  q  processors  are  appended 
to  the  left  of  the  array  for  the  computation  of  equations  [i,  k  + i],  resulting  in  a  total  of  q(w  +  k) 
processors.  Now,  processor  (A,  1)  transmits  G^  to  its  left  and  right,  so  (t +  i,i+y]  and  (/,4  +  ij 
are  computed  concurrently.  The  fcth  processor  (from  top)  in  the  fth  column  (from  the  right)  has  a 
register  A{,t  which,  like  At,  enforces  (P2);  compare  Figure  3(c).  If  the  entry  equal  to  one  in 
is  implicitly  assumed,  the  transformation  can  he  represented  and  transmitted  by  three  cumbers. 
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The  processor  array  depicted  in  Figure  4.  In  general,  0/,i  enters  q-l 

steps  before  «*u;  and  succeeding  elements  of  1/  enter  every  two  steps.  0$  b  output  j  steps  after 
Starting  the  input  with  fat  at  f  »  1,  the  computation  time  comes  to  2m  +  Z(q  - 1)  +  A  steps 


0.  Remarks 

The  time  to  for  the  QR  factorisation  of  a  rectangular  m  x  n  matrix  in  q  x  w  array  ‘ 
2(max{m,n}  +7  —  1).  Hie  factorisation  of  full  d-nae  matrices  requires  m  - 1  processor  vectors 
length  m  +  n  —  1  and  time  2(max{m,  n}  +  m  -  1);  hence  b  less  efficient. 

A  lower  tringularj  QL  decomposition,  b  obtained  by  reversing  the  direction  of  the  horizontal 
interconnections  and  computing  equations  [£,•)  in  the  rightmost  processor,  as  in  [HI83]  for  standard 
Givens’  rotations. 

Matrices,  whose  bandwidth  exceeds  the  length  of  a  processor  vector,  must  be  partitioned  into 
smaller  submatrices  of  suitable  size.  They  are  then  separately  ,  in  proper  order,  input  to  the 
array.  ‘Recycling’  of  already  computed  rotations  might  be  necessary.  Matrices  with  too  small  a 
bandwidth  are  input  left  bound. 
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2.  8 


Processors  (i,l)  are  initialised  with  w*,  1  <  k  <  2. 


Figure  1:  Processor  Array  for  the  Weighted  Multiple  linear 
Least  Squares  Problem,  q  «  p  ■*  k  «  2. 
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to  the  left  of  the  array  for  the  computation  of  equations  [f,fc  + i),  resulting  in  a  total  of  <j(w  +  L) 
processors.  Now,  processor  (k,  1)  transmits  G^  to  its  left  and  right  so  [fc  +  •,•  +  /]  and  [f,  k  +  i] 
are  computed  concurrently.  The  kth  processor  (from  top)  in  the  fth  column  (from  the  right)  has  a 
register  Ai,*  which,  like  As,  enforces  (P2);  compare  Figure  3(c).  If  the  entry  equal  to  one  in 
is  implicitly  assumed,  the  transformation  can  be  represented  and  transmitted  by  three  numbers. 
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